Why Distributed SQL is the database, evolved.
Free O'Reilly GuideA recent tweet inspired me to address the need for fuzzy matching by combining some existing capabilities of CockroachDB. Note the key features mentioned in the tweet:
What I found interesting about this exercise is that this pattern of configuring a changefeed to route events through an external system, then back into the database has enormous potential, far beyond the fuzzy matching example that you’ll read about below. It seems that development practices have evolved away from burying logic in the database and towards coding it in the languages most applicable to the problem; the emergence of microservices aligns with this trend. That observation, combined with a new feature of being able to define a changefeed on a specific column family, has me convinced we’ll see some very interesting applications of this pattern with CockroachDB.
Now I’ll get off of my soapbox and into the demo!
To begin with, CockroachDB isn’t a fork of PostgreSQL, so you don’t simply “bolt on” the usual Postgres extensions such as pg_trgm
, the module mentioned in the tweet. But this is a small obstacle and, in fact, offers the opportunity to demonstrate an up-and-coming feature of CockroachDB Enterprise Changefeeds, aka “CDC”: the ability to configure a changefeed on a specific column family. This feature is available in the v22.1.0-beta.1
version I am using here – this release should be generally available by mid-May 2022.
We do INSERT, UPDATE, DELETE on data about sports teams and would like to retrieve this data based on queries where we may misspell names. I confess to having misread the tweet, interpreting it as pertaining to sports team names as opposed to sport events names, so what I show here uses team name data, but I think it’s easy to apply to event names as well, so long as you have access to that data.
If we envision the requirement arises out of an application’s need to perform this type of matching, we can build a simple REST app which will:
Here’s the DDL for the table. The FAMILY
… parts of this enable the changefeed to generate events based solely on FAMILY f1
, ignoring the resulting changes to FAMILY f2
, which is what we need:
CREATE TABLE teams
(
id UUID PRIMARY KEY DEFAULT GEN_RANDOM_UUID()
, name TEXT NOT NULL
, grams TEXT[]
, FAMILY f1 (id, name)
, FAMILY f2 (grams)
);
And here’s the inverted index on the grams
column:
CREATE INDEX ON teams USING GIN (grams);
Now, we need the REST app. I use Python for this since it’s concise and easy to get up and running, and the Flask module works very well for REST apps. And, I had some code fragments hanging around which I could reuse pretty easily. The entire Python script is here, but I’ll focus on interesting aspects below.
def get_ngrams(s, n=3):
return [s[i:i+n] for i in range(len(s) - n+1)]
For the input la galaxy
, the return value is ['la ', 'a g', ' ga', 'gal', 'ala', 'lax', 'axy']
. The default value of n
is 3, which aligns with the way pg_trgm
works.
A couple of things to point out:
@app.route('/cdc', methods = ['POST', 'GET'])
def cdc_webhook():
obj = request.get_json(force=True)
print("CDC: " + json.dumps(obj)) # DEBUG
for o in obj["payload"]:
if o["after"] is None:
pass # Nothing to be done here
else:
pk = o["after"]["id"]
name = o["after"]["name"]
index_string(pk, name)
return "OK", 200
The changefeed will send an HTTP POST to this /cdc
endpoint, as configured via the SQL statement shown below. When that happens, JSON is accessible by the call to request.get_json(force=True)
, and that JSON contains the current values of the id
and name
columns from the teams
table.
index_string(pk, name)
function, which just updates the grams
column of the teams
table:def index_string(pk, content):
ng = tokenize(content)
stmt = text("UPDATE teams SET grams = :grams WHERE id = :pk").bindparams(grams=ng, pk=pk)
run_statement(stmt)
I won’t get into the run_statement(stmt)
function here, but will instead refer interested readers to the code itself.
CREATE CHANGEFEED FOR TABLE teams FAMILY "f1"
INTO 'webhook-https://localhost:18080/cdc?insecure_tls_skip_verify=true'
WITH updated, full_table_name, topic_in_value;
Note that this operation requires an Enterprise License, but is also available in the single node “demo” mode (see below).
A REST client can retrieve fuzzy matches for a given team name by doing something equivalent to this (this example was run on my MacBook; the base64
command works differently here than it does on Linux; pretty_print_json.py
is here):
$ name="PA Galuxy"; time curl -k -s https://localhost:18080/search/$( echo -n $name | base64 )/5 | pretty_print_json.py
[
{
"name": "LA Galaxy",
"pk": "15e240a7-d1db-4b77-b454-c895a11610bf",
"score": "42.8571"
},
{
"name": "LA Galaxy II",
"pk": "85b5b97a-9a6e-4cef-b63c-7cbe123eca07",
"score": "10.7143"
},
{
"name": "LA Giltinis",
"pk": "82b96763-6836-4ab8-84ad-1864a1f3e16d",
"score": "4.7619"
},
{
"name": "Tampa Mayhem",
"pk": "6f5fd5e0-6654-4385-9bd0-c191f4f1c5b4",
"score": "3.5714"
},
{
"name": "Tampa Tarpons",
"pk": "1851c8e9-77e8-456b-a197-6aaed971942a",
"score": "2.8571"
}
]
real 0m0.196s
user 0m0.063s
sys 0m0.043s
Before getting into the Python code for the /search
endpoint, the following are worthy of mention:
On to the Python part of this interaction:
@app.route("/search/<q_base_64>/<int:limit>")
def do_search(q_base_64, limit):
query_str = decode(q_base_64)
logging.info("Query: {}".format(query_str))
ng = tokenize(query_str)
logging.info("Query (n-grams): {}".format(ng))
sql = """
WITH qbool AS
(
SELECT id, grams, 1 + ABS(ARRAY_LENGTH(grams, 1) - ARRAY_LENGTH(CAST(:ngrams AS TEXT[]), 1)) delta
FROM teams
WHERE grams && CAST(:ngrams AS TEXT[])
), qscore AS
(
SELECT id, COUNT(*) n FROM
(
SELECT id, UNNEST(grams) FROM qbool
INTERSECT
SELECT id, UNNEST(CAST(:ngrams AS TEXT[])) FROM qbool
)
GROUP BY id
)
SELECT qbool.id, t.name, 100*n/delta score
FROM qbool, qscore, teams t
WHERE qbool.id = qscore.id AND t.id = qbool.id
ORDER BY score DESC
LIMIT :max_rows;
"""
stmt = text(sql).bindparams(ngrams=ng, max_rows=limit)
rv = []
for row in run_statement(stmt, True, False):
pk = str(row[0])
name = str(row[1])
score = float(row[2]/len(ng))
d = {}
(d["pk"], d["name"], d["score"]) = (pk, name, '{:.4f}'.format(score))
rv.append(d)
return Response(json.dumps(rv), status=200, mimetype="application/json")
There’s a fair bit going on here, mostly within the SQL expression. I’ll do my best to narrate that:
qbool
handles the boolean nature: is this row a match or not? It also provides an additional scoring input, which is the difference in the length of the provided query string and the actual value in the grams
column. This is used to, for example, boost the score for the row containing “LA Galaxy” relative to a row containing “LA Galaxy II”.WHERE grams && CAST(:ngrams AS TEXT[])
, incorporates the &&
(array overlap) operator. Ideally, this operation would use the GIN index; this pull request addresses that, has been merged, so is currently available on main. It will be included in a future release – stay tuned.qscore
just uses the results of qbool
to determine a score based on the number of overlapping n-grams between the matching row and the query.name
column from the teams
table to generate an ordered result to return to the client in JSON format (see the above example).Here are some notes on how to replicate what I’ve shown above. Given I’ve already done this, it’s possible I’ll leave out some steps, so please let me know (GitHub issue is probably the simplest way). I’m going to illustrate this using an Ubuntu VM I’ve just deployed in Google Cloud Platform.
psql
CLI):sudo apt update
sudo apt install postgresql-client-common -y
sudo apt install postgresql-client -y
sudo apt install python3-pip -y
sudo apt install libpq-dev -y
pip3 install -r requirements.txt
$ curl https://binaries.cockroachdb.com/cockroach-v22.1.0-beta.1.linux-amd64.tgz | tar xzvf -
$ ./cockroach-v22.1.0-beta.1.linux-amd64/cockroach demo
In the output, you’ll see a line like the following, which you’ll use in a couple of contexts:
# (sql)
postgresql://demo:demo15932@127.0.0.1:26257/movr?sslmode=require
$ git clone https://github.com/mgoddard/hot-fuzz.git
hot-fuzz
):$ cd hot-fuzz/
$ ls
LICENSE README.md images prep_teams_data.pl pretty_print_json.py trigrams.py
$ export DB_CONN_STR="postgresql://demo:demo15932@127.0.0.1:26257/movr?sslmode=require"
$ ./trigrams.py
demo@127.0.0.1:26257/movr> CREATE TABLE teams
(
id UUID PRIMARY KEY DEFAULT GEN_RANDOM_UUID()
, name TEXT NOT NULL
, grams TEXT[]
, FAMILY f1 (id, name)
, FAMILY f2 (grams)
);
CREATE INDEX ON teams USING GIN (grams);
demo@127.0.0.1:26257/movr> SET CLUSTER SETTING kv.rangefeed.enabled = TRUE;
demo@127.0.0.1:26257/movr> CREATE CHANGEFEED FOR TABLE teams family "f1"
INTO 'webhook-https://localhost:18080/cdc?insecure_tls_skip_verify=true'
WITH updated, full_table_name, topic_in_value;
$ cd hot-fuzz/
$ DB_CONN_STR="postgresql://demo:demo15932@127.0.0.1:26257/movr?sslmode=require"
$ curl -s https://en.wikipedia.org/wiki/List_of_professional_sports_teams_in_the_United_States_and_Canada | ./prep_teams_data.pl | psql $DB_CONN_STR
$ name="PA Galuxy"; time curl -k -s https://localhost:18080/search/$( echo -n $name | base64 )/5 | ./pretty_print_json.py
[
{
"name": "LA Galaxy",
"pk": "73b541c1-e3d4-4f1d-8598-bedc11200f03",
"score": "42.8571"
},
{
"name": "LA Galaxy II",
"pk": "3a3b0e44-68cc-45da-aa72-9467e46f0458",
"score": "10.7143"
},
{
"name": "LA Galaxy II",
"pk": "9303e054-d6ad-4793-b904-9e3eeba28ff9",
"score": "10.7143"
},
{
"name": "LA Giltinis",
"pk": "9b8bfbb4-1ac6-477d-aaed-79daf1183f02",
"score": "4.7619"
},
{
"name": "Tampa Mayhem",
"pk": "f566f12f-7f8d-4ae6-b04b-832a5f9025c5",
"score": "3.5714"
}
]
real 0m0.049s
user 0m0.052s
sys 0m0.008s
If you’d like to
The author wishes to thank the following individuals for providing valuable input which made this blog post possible:
@schrepfler
, for tagging @CockroachDB
in the tweet about n-grams and fuzzy matchingImagine you’re driving a rental car in Rome and the satnav (or GPS) on your phone stops working. This happened to me two …
Read moreAs a developer, you are often faced with the question: why is my application slow? You may have already identified the …
Read more
CockroachDB’s support for SQLAlchemy is currently in beta, but we’re actively developing new …
Read more